The goal of this project is to know whether a savings customer will take a credit or not. We have different sources of data, including savings account transactions, ZIP code, ATM geographical and transactional information and open data regarding crime and sociodemographic areas in Mexico.
In this document we present the analysis that led to the computation of four models. There are around 12 million savings customers and 800 thousand credit AND savings customers in Banco Azteca (BAZ), from which we have a sample of 1 million people for savings. The analysis in this document is based on the information of this sample and the whole population of credit customers.
| abonos | abonos_monto | retiros | retiros_monto | num_meses | tiempo_meses | freq | |
|---|---|---|---|---|---|---|---|
| 0% | 0 | 0.000 | 0 | -39804335.20 | 1 | 1 | 0.0312500 |
| 5% | 1 | 1.000 | 0 | -135549.10 | 1 | 6 | 0.0526316 |
| 10% | 1 | 50.000 | 0 | -80000.00 | 1 | 8 | 0.0714286 |
| 15% | 1 | 150.000 | 1 | -55267.97 | 2 | 11 | 0.0967742 |
| 20% | 1 | 554.000 | 1 | -40600.00 | 2 | 13 | 0.1250000 |
| 25% | 2 | 1212.975 | 2 | -30724.68 | 3 | 14 | 0.1428571 |
| 30% | 2 | 2075.000 | 2 | -23850.00 | 3 | 16 | 0.1666667 |
| 35% | 3 | 3200.000 | 3 | -18616.00 | 4 | 18 | 0.2000000 |
| 40% | 3 | 4636.000 | 4 | -14750.00 | 4 | 19 | 0.2222222 |
| 45% | 4 | 6200.000 | 5 | -11500.00 | 5 | 21 | 0.2500000 |
| 50% | 5 | 8350.000 | 7 | -9006.00 | 5 | 23 | 0.2812500 |
| 55% | 6 | 10800.000 | 8 | -6940.00 | 6 | 25 | 0.3125000 |
| 60% | 8 | 14000.000 | 10 | -5170.00 | 7 | 27 | 0.3333333 |
| 65% | 9 | 17850.000 | 12 | -3897.79 | 8 | 29 | 0.3600000 |
| 70% | 11 | 22917.640 | 15 | -2650.00 | 9 | 30 | 0.3750000 |
| 75% | 14 | 29950.000 | 19 | -1650.00 | 10 | 31 | 0.3846154 |
| 80% | 18 | 39550.000 | 25 | -900.00 | 10 | 32 | 0.4210526 |
| 85% | 24 | 54000.000 | 32 | -300.00 | 11 | 32 | 0.5000000 |
| 90% | 33 | 78500.000 | 45 | 0.00 | 12 | 32 | 0.5882353 |
| 95% | 53 | 132810.063 | 72 | 0.00 | 12 | 32 | 0.7500000 |
| 100% | 60199 | 42131739.960 | 22009 | 0.00 | 12 | 32 | 1.0000000 |
Number of months / Total months: Value between 0 and 1. If it’s 1 it means that the customer made an activity in all of the months that are available in the data; if it’s 0 it means that no activity took place. The value of 0 is not possible in this database because these are customers with at least one transaction.
| credit_or_savings | electronic_banking | proportion |
|---|---|---|
| credit | 0 | 0.9509844 |
| credit | 1 | 0.0490156 |
| savings | 0 | 0.9553800 |
| savings | 1 | 0.0446190 |
| credit_or_savings | active_electronic_banking | proportion |
|---|---|---|
| credit | 0 | 0.9686378 |
| credit | 1 | 0.0313622 |
| savings | 0 | 0.9772680 |
| savings | 1 | 0.0227310 |
It can be seen that the number of people taking credits has been decreasing.
We have information about the customers’ ZIP code. This information could be used, with public available information from sources like INEGI, to know the socioeconomic level of each savings customer.
Available sources:
AGEB stands for Área GeoEstadística Básica (Basic Geostatistical Area), and a locality is a general term used by CONAPO to define several AGEBs.
This document uses information from the socioeconomic regions defined by INEGI.
ZIP code geographical information is available. According to the official postal code webpage, there are 32,448 different ZIP codes in Mexico, from which around 25,000 are available as shape files. The official ZIP code shapefiles are available in the open data government webpage, but not all them are available yet, the mexican postal service is still working in finding the delimiters of each code. Other resources are available, for example, a non-official collection of shapefiles of neighborhoods and ZIP codes. In addition, Google’s API for geocoding is a useful tool which is used as a last resort to find information about some ZIP codes.
Even with all this available information, there’s still a problem, which is that there are a bunch of ZIP codes which aren’t officially assigned to any human settlement but that are being used by people due to tradition or misinformation. So, geographic information may not be available for all customers, but it will be for most of them.
The polygons defining the ZIP codes aren’t equivalent to the polygons defining the AGEBs, so a mapping between them is needed to be able to use the public available information. Perhaps the simplest solution is to find the centroid of each ZIP code and AGEB, and then just map a given ZIP code to the closest AGEB centroid.
We have a classification for each AGEB that pretends to show the differences among AGEBs based on indicators related with housing, education, health and employment, built from the last population census. Each AGEB can be classified in 7 strata such that stratum 7 contains AGEBs with the most favorable average conditions, and in stratum 1 are the AGEBs with the least favorable average conditions.
In the next images, maps of Mexico City and surroundings, Monterrey and Guadalajara are shown.
Map of Mexico City with centroids of each polygon:
Now, same map for Guadalajara, Jalisco:
And finally, for Monterrey, Nuevo León:
ZIP code information with their centroids can be seen in the next map of Mexico City:
ZIP code information with their centroids can be seen in the next map of Guadalajara. Some of the centroids may not match perfectly the polygon plotted because the database considers a the ZIP code and the identifier as a different group.
ZIP code information with their centroids can be seen in the next map of Monterrey:
Finally, plotting the centroids of AGEBs and ZIP codes in Mexico City altogether we get:
Guadalajara:
Monterrey:
So, for each available ZIP code, the closest AGEB centroid is found and a mapping is made to assign an AGEB to each ZIP code, such that we get a table in the following format:
| ZIP | ZIP long | ZIP lat | Nearest AGEB | AGEB long | AGEB lat | Distance in Km | Classification |
|---|---|---|---|---|---|---|---|
| 56364 | -98.93143 | 19.44496 | 1.503100e+12 | -98.93469 | 19.44372 | 0.3680725 | 3 |
| 56367 | -98.95076 | 19.44106 | 1.503100e+12 | -98.94869 | 19.43894 | 0.3201608 | 4 |
| 56365 | -98.94247 | 19.43852 | 1.503100e+12 | -98.94134 | 19.43799 | 0.1325068 | 4 |
| 96340 | -94.60759 | 18.00084 | 3.004801e+12 | -94.60721 | 18.00117 | 0.0547658 | 6 |
| 42850 | -99.33818 | 19.92243 | 1.306300e+12 | -99.33511 | 19.91824 | 0.5655460 | 6 |
| 57850 | -98.97560 | 19.38088 | 1.505800e+12 | -98.97690 | 19.38002 | 0.1661747 | 6 |
| 97300 | -89.70512 | 21.01598 | 3.110000e+12 | -89.74094 | 21.02427 | 3.8302693 | 2 |
| 61531 | -100.37365 | 19.42391 | 1.611200e+12 | -100.37496 | 19.42216 | 0.2384809 | 4 |
| 41706 | -98.41225 | 16.69447 | 1.204600e+12 | -98.40838 | 16.69271 | 0.4568835 | 4 |
| 53750 | -99.24115 | 19.45593 | 1.505700e+12 | -99.24014 | 19.45617 | 0.1088229 | 6 |
In the following graph, a histogram is plotted showing the distribution of the distance between the centroid of the ZIP code and the centroid of the AGEB. The red lines represent quantiles 0.5, 0.75, 0.9 and 0.95. As can be seen, most of the mass is concentrated in distances shorter than 10 Km. This may seem like little, but in the case of a city, the landscape can change dramatically in 10 Km.
In the following graph, the distance histogram is plotted once more, but with with a different graph depending on whether the ZIP code is in a rural, urban, semiurban or unknown type of area. In the urban and semiurban areas, more than 95% of ZIP codes are within a 2.5 Km distance from the closest centroid. The rural areas are the ones that have a shorter tail, which seems reasonable because rural areas are usually larger and AGEB information is scarse in these areas.
The following graph shows the distribution of the distance of the 4 main states in Mexico.
The next graph combines the data of the last two graphs: it shows the distance distribution depending on whether the area is rural, urban, semiurban or unknown and on whether the ZIP code is in any of the 4 biggest states in Mexico. Once more, in the urban and semiurban areas the distance is smaller than in rural areas.
This approach may fail in the rural areas and also, as can be noted, ZIP code polygons are generally bigger in area than AGEBs, so the heterogeneity of each ZIP code is being ignored.
First, let’s see what’s the distribution of the classification of AGEBs in the country. Remember that 7 is that the AGEB is “good” in average and that 1 is that it’s “bad”.
And now, the mapping of the ZIP codes:
The distribution changed considerably. As we can see in the following graph, originally the AGEBs were urban (U) and rural (R), but the mapping consists of only urban ZIP codes; so this may be a reason of why the distribution changed so much.
And now let’s analyze the sample with 1 million savings customers and circa 800 thousand credit customers.
Out of the 1859441, we have the mapping ZIP code for 1590674 of them, which are distributed the following way:
And now, conditioning on whether it’s a credit or savings customer:
Using information about crime reports we create four indexes that together give us a picture of the crime in the region. The indexes that we produce are:
Crime dimension: this index give us a summarized idea of the total crime in the region.
Non violent crime dimension: this index tell us about the number of non violent crimes in the region.
Violent crime dimension: this index tell us how about the number of violent crimes in the region.
Kidnap dimension: this index tell us about the number of kidnaps in the region.
To make the models, some variables were computed based on the transactions people have ¿¿¿¿in their savings account??????. These variables aim to reflect some kind of economical stability in their accounts, and computed making the assumption that the behavior prior to taking a credit is different than it is the rest of the time. To capture this idea in the variables, all of them were computed at different times prior to the date in which a credit was taken; for the customers that don’t have any credit, the last transaction date was used.
Also, the geographic variables were included. So, the variables computed were:
After this, a random forest was trained using all of these variables and the importance of each variable was computed, and the results were the following:
| Variable | Importance |
|---|---|
| Number of transactions in the last 30 days | 16.020907 |
| Number of days in before minimum amount was deposited filtering by one month | 12.153303 |
| Number of transactions in the last 360 days | 11.933149 |
| Number of days in before maximum amount was deposited filtering by one month | 8.857325 |
| Number of days in before maximum amount was deposited filtering by one year | 8.472480 |
| Median of overall transactions in the last year | 7.976154 |
| Ratio of the median of deposits and median of withdrawals for the last 3 months | 6.314436 |
| Ratio of the maximum deposit and the median of deposits for the las year | 6.278312 |
| Number of days in before minimum amount was deposited filtering by one year | 6.185391 |
| Number of withdrawals in the last year | 5.785143 |
| Median of deposits fot the last year | 5.757832 |
| Number of transactions in the last 3 months | 5.621182 |
| Number of the deposits in the last year | 5.423741 |
| Maximum deposit in the last year | 5.349114 |
| Minimum deposit in the last six months | 5.292315 |
| Ratio of the median of the deposits and median of overall transactions | 5.260750 |
| Sum of the deposits in one year | 5.253443 |
| Ratio of maximum withdrawal and median of overall withdrawals | 5.092131 |
| Sum of withdrawals in one year | 5.082976 |
| Crime index of the area where the customer lives | 5.059758 |
The following plots show the densities of each of these variables conditioned by the response variable (1: credit, 0: savings). The vertical lines are percentiles 50, 75, 90 and 95.
The models were trained with 74,384 people, from which 18,746 took credits in the years 2014 and 2015; the rest only have savings accounts. The models were: Logistic regression, random forest, Gradient Boosting and Support Vector Machines. The following tables show the results of each model:
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 16641 343
## 1 24 5307
##
## Accuracy : 0.9836
## 95% CI : (0.9818, 0.9852)
## No Information Rate : 0.7468
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9557
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9986
## Specificity : 0.9393
## Pos Pred Value : 0.9798
## Neg Pred Value : 0.9955
## Prevalence : 0.7468
## Detection Rate : 0.7457
## Detection Prevalence : 0.7611
## Balanced Accuracy : 0.9689
##
## 'Positive' Class : 0
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 16567 1119
## 1 98 4531
##
## Accuracy : 0.9455
## 95% CI : (0.9424, 0.9484)
## No Information Rate : 0.7468
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8466
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9941
## Specificity : 0.8019
## Pos Pred Value : 0.9367
## Neg Pred Value : 0.9788
## Prevalence : 0.7468
## Detection Rate : 0.7424
## Detection Prevalence : 0.7926
## Balanced Accuracy : 0.8980
##
## 'Positive' Class : 0
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 16464 3589
## 1 201 2061
##
## Accuracy : 0.8302
## 95% CI : (0.8252, 0.8351)
## No Information Rate : 0.7468
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4399
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9879
## Specificity : 0.3648
## Pos Pred Value : 0.8210
## Neg Pred Value : 0.9111
## Prevalence : 0.7468
## Detection Rate : 0.7378
## Detection Prevalence : 0.8986
## Balanced Accuracy : 0.6764
##
## 'Positive' Class : 0
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 16407 1396
## 1 258 4254
##
## Accuracy : 0.9259
## 95% CI : (0.9224, 0.9293)
## No Information Rate : 0.7468
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.79
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9845
## Specificity : 0.7529
## Pos Pred Value : 0.9216
## Neg Pred Value : 0.9428
## Prevalence : 0.7468
## Detection Rate : 0.7352
## Detection Prevalence : 0.7978
## Balanced Accuracy : 0.8687
##
## 'Positive' Class : 0
##
As we can see the four models have a good performance. The random forest has the best results from all of them with 95% accuracy in the test sample; 94.5% of accuracy in predicting that a customer have just savings account and a 98.8% accuracy on predicting that a custemer has a credit. This results might sound promising but we must stay calm and keep repeating the test over the four models to ensure that a direct campaign could have the expected results.